Skip to content

refactor: use Parquet metadata for row counts in validate#39

Merged
shaypal5 merged 2 commits into
mainfrom
parquet-metadata-row-counts
May 1, 2026
Merged

refactor: use Parquet metadata for row counts in validate#39
shaypal5 merged 2 commits into
mainfrom
parquet-metadata-row-counts

Conversation

@shaypal5

@shaypal5 shaypal5 commented May 1, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Replace pd.read_parquet() with pq.read_metadata().num_rows in _check_task_splits() for row count validation — avoids loading full Parquet data
  • Replace pd.read_parquet(path, columns=[]) with pq.read_schema().names in _check_leakage() for column name checks — reads only the schema footer
  • Add 3 new tests: metadata/data consistency, task split row count mismatch, leakage column detection via schema

Note: _check_tables() still loads full DataFrames because they are needed downstream for FK integrity checks.

Closes #17

Test plan

  • All 757 tests pass
  • Ruff lint clean
  • New test test_task_split_row_count_mismatch verifies error detection via metadata
  • New test test_leakage_detects_extra_columns verifies column detection via pq.read_schema()
  • New test test_task_split_metadata_matches_data verifies metadata row counts match actual data

🤖 Generated with Claude Code

Replace full-file reads with pyarrow metadata reads in bundle validation:
- _check_task_splits: pq.read_metadata().num_rows instead of pd.read_parquet()
- _check_leakage: pq.read_schema().names instead of pd.read_parquet(columns=[])

Closes #17

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 1, 2026 16:07
@shaypal5 shaypal5 added type: refactor Code change with no behavior difference layer: validation validation/ invariants and checks labels May 1, 2026
@github-actions

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors bundle validation to use Parquet footer metadata/schema for lightweight checks, reducing memory usage and speeding up validate on larger bundles.

Changes:

  • Use pyarrow.parquet.read_metadata(...).num_rows for task split row-count validation instead of loading full DataFrames.
  • Use pyarrow.parquet.read_schema(...).names for leakage column detection instead of pd.read_parquet(..., columns=[]).
  • Add tests covering metadata/data consistency, task split row-count mismatch detection, and leakage detection via schema.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
leadforge/validation/bundle_checks.py Switches task split row-count checks to Parquet metadata and leakage checks to Parquet schema reads.
tests/validation/test_bundle_checks.py Adds regression tests for metadata row counts, task split row mismatch detection, and schema-based leakage detection.
.agent-plan.md Documents completion of the Parquet-metadata validation refactor and associated tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread leadforge/validation/bundle_checks.py Outdated
Comment thread leadforge/validation/bundle_checks.py Outdated
- Only call pq.read_metadata() when manifest has expected row count
  (avoids unnecessary I/O for partial manifests)
- Replace pyarrow-vs-pandas consistency test with monkeypatch test
  proving _check_task_splits never calls pd.read_parquet

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions

github-actions Bot commented May 1, 2026

Copy link
Copy Markdown

pr-agent-context report:

No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #39 in repository https://github.com/leadforge-dev/leadforge. Treat this PR as all clear unless new signals appear.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25223507963 attempt 1
Comment timestamp: 2026-05-01T16:51:11.951535+00:00
PR head commit: 052ee9e010d3e6310288a6a184323b13bb5e6f42

@shaypal5 shaypal5 merged commit 2ae3784 into main May 1, 2026
7 checks passed
@shaypal5 shaypal5 deleted the parquet-metadata-row-counts branch May 1, 2026 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

layer: validation validation/ invariants and checks type: refactor Code change with no behavior difference

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: use Parquet metadata for row counts in validate command

2 participants